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Abstract-  Splice  junction  classification  in  a  Eukaryotic  cell  is  an 
important  problem  because  tbe  splice  junction  indicates  wbicb 
part  of  tbe  DNA  sequence  carries  protein-coding  information. 
Tbe  major  issue  in  building  a  classifier  for  tbis  classification  task 
is  how  to  represent  the  DNA  sequence  on  computers  since  the 
accuracy  of  any  classification  technique  critically  hinges  on  the 
adopted  representation.  This  paper  presents  the  experimental 
results  on  seven  representation  schemes.  The  first  three 
representations  interpret  each  DNA  sequence  as  a  series  of 
symbols.  The  fourth  and  fifth  representations  consider  the 
sequence  as  a  series  of  real  numbers.  Moreover,  the  first,  second 
and  fourth  representations  do  not  consider  the  influence  of  the 
neighbors  on  the  occurrence  of  a  nucleotide,  whereas  the  third 
and  fifth  representations  take  the  Influence  of  the  neighbors  into 
considerations.  To  capture  certain  regularity  in  the  apparent 
randomness  in  the  DNA  sequence,  the  sixth  representation  treats 
the  sequence  as  a  variant  of  random  walk.  The  seventh 
representation  uses  Hurst  coefficient,  which  quantifies  the 
roughness  of  the  DNA  sequence.  The  experimental  results 
suggest  that  the  fourth  representation  scheme  makes  sequences 
from  the  same  class  close  and  the  sequences  from  the  different 
classes  far,  and  thus  finds  a  structure  in  the  input  space  to 
provide  the  best  classification  result. 
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I.  Introduction 

Problem  description:  The  biochemical  material  that  carries 
hereditary  characteristics  from  parents  to  offspring  is 
contained  in  a  sequence  of  chemical  known  as 
deoxyribonucleic  acid  (DNA).  A  gene  consists  of  a 
continuous  stretch  of  DNA  that  is  needed  to  produce  a 
particular  protein.  The  process  by  which  the  DNA  gives  rise 
to  a  protein  is  called  gene  expression.  In  a  eukaryotic  cell, 
i.e.,  the  cell  that  contains  a  nucleus,  the  gene  expression 
involves  the  synthesis  of  premRNA  on  the  DNA  templates 
{transcription),  removal  of  the  non-coding  region  (splicing) 
from  the  premRNA  to  form  mRNA,  and  the  synthesis  of  the 
protein  on  the  mRNA  templates  (translation).  Due  to  the 
splicing,  a  DNA  sequence  consists  of  alternating  segments  of 
exon  and  intron,  where  an  exon  is  a  nucleotide  sequence  that 
is  expressed  or  translated  into  protein,  and  an  intron  is  an 
intervening  sequence  that  is  transcribed  into  RNA,  but  later 
eliminated  from  the  transcript  by  splicing  its  adjacent  exons 
(Fig.  1).  The  splice  junction  refers  to  the  point  where  the 
splicing  takes  place,  i.e.,  it  is  the  meeting  point  of  intron  and 
exon. 

Motivation:  Localization  of  protein  coding  region  in  a  DNA 
sequence  by  pure  biological  means  is  a  time-consuming  and 


costly  procedure.  Hence,  many  computational  methods  have 
been  attempted  to  recognize  the  splice  junctions.  To  the  best 
of  our  knowledge,  no  paper  has  highlighted  on  studying  the 
representation  issues  of  the  splice  junction  problem.  The 
performance  of  any  classification  technique  critically  depends 
on  the  representation.  For  instance,  the  DNA  sequence  can  be 
represented  as  a  symbol,  binary  numbers  or  real  numbers.  In 
addition,  we  can  exploit  the  influence  of  the  neighbors  or 
some  other  characteristics  of  the  sequence  to  form  better 
representation  schemes. 

Objective:  This  work  conducts  experiments  to  find  the 
appropriate  representation  for  the  splice  junction 
classification  task.  It  involves  (a)  recognizing  exon/intron 
boundaries,  or  “donor”  sites,  (b)  recognizing  intron/exon 
boundaries,  or  “acceptor”  sites,  and  (c)  neither.  To  measure 
the  generalization  capability  of  the  classifier,  we  have  used 
standard  classification  error  measures. 


Exon-Intron  junction  Intron-Exon  junction 


Fig.  1:  The  transcription  and  spicing  of  a  gene  in  the  nuclease  of  a 
eukaryotic  cell.  Each  circle  in  the  DNA  sequence  represents  a  nucleotide, 
i.e.,  any  one  of  Adenine  (A),  Thiamine  (T),  Cytosine  (C)  or  Guanine  (G). 
Each  circle  in  the  premRNA  and  mRNA  is  A,  U  (Urethane),  C  or  G. 

Scope:  The  data  set,  which  we  have  collected  from 
[Blake:98],  contains  3190  samples.  Approximately  25% 
samples  of  the  data  set  have  intron-exon  boundaries,  the  other 
25%  samples  have  exon-intron  boundaries,  and  the  remaining 
50%  samples  have  no  boundaries.  Each  sample  is  a  sequence 
of  60  nucleotides,  which  we  denote  by  El  through  F60,  and 
the  boundary  (if  any)  is  just  at  the  middle  (Fig.  2).  Each 
sequence  is  any  one  from  the  three  classes,  and  the  aim  is  to 
identify  the  midpoint  of  the  sequence  as  being  an  exon-intron 
(El)  boundary,  an  intron-exon  (IE)  boundary,  or  neither 
boundary  (N). 
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Issues:  The  aim  of  an  appropriate  representation  is  to  form  a 
structure  in  the  input  space  such  that  two  sequences  from  the 
same  class  remain  in  the  close  vicinity  in  the  input  space,  and 
the  sequences  from  different  classes  remain  wide  apart.  This 
problem  could  be  posed  by  considering  the  frequency  of 
occurrence  of  the  nucleotides  or  by  defining  certain  distance 
function  in  the  input  space.  Both  of  these  two  techniques 
implicitly  compress  the  information  needed  to  represent  the 
input  sequence.  It  is  achieved  by  making  the  representations 
similar  (dissimilar)  for  sequences  of  same  (different)  class. 
The  compression  becomes  more  effective  when  some 
intrinsic  properties  of  the  sequence  like  influence  of  the 
neighbors,  roughness  are  reflected  in  the  representation. 

Another  major  problem  is  the  representation  of  the  sequences 
that  have  neither  exon/intron  nor  intron/exon  boundary. 
Representing  patterns  for  this  class  is  difficult  since  (a)  the 
space  covered  by  this  class  is  very  large,  and  often  not 
enough  training  patterns  are  present  to  cover  such  a  large 
space,  and  (b)  this  class  and  the  other  two  classes  are 
overlapping. 

Splice  j  unction 


FI  F30  F31  F60 


Fig.  2:  Splice  junction  classification  problem  with  the  given  set  of  data.  Each 
DNA  sequence  contains  sixty  nucleotides,  and  their  position  is  indicated  from 
the  left  by  FI,  F2,...,  F60.  The  boundary  (if  any)  occurs  between  F30  and 
F31. 

IL  Methodology 

Representation  1:  A  pure  symbolic  approach  is  adopted  in 
this  representation.  All  the  four  possible  nucleotides  are 
represented  using  different  symbols,  for  instance  A  for 
Adenine,  T  for  Thiamine,  C  for  Cytosine  and  G  for  Guanine. 
Hence,  a  sequence  of  length  60  is  represented  as  series  of  60 
symbols.  We  can  construct  a  mle  base  of  all  possible  (4^®)  if- 
then  mles  that  relate  any  valid  sequence  of  length  60  and  the 
output  class.  When  a  new  sequence  arises,  the  class  label  of 
this  sequence  can  be  determined  using  the  class  label  of  the 
matching  sequence  in  the  mle  base.  However,  constructing 
and  accessing  such  a  large  rule  base  is  computationally 
intractable.  In  contrast,  if  we  construct  a  database  with  less 
number  of  rules,  then  for  some  valid  input  sequence,  no  rule 
may  fire  because  there  is  no  match  between  the  input  and  the 
mles.  Hence,  we  need  to  decrease  the  number  of  mles  without 
compromising  much  in  the  classification  efficiency.  In  other 
words,  from  a  small  set  of  rules  the  classifier  needs  to 
generalize  such  that  the  classifier  can  classify  any  sequence 
with  high  classification  efficiency.  The  generalization  with 
less  number  of  rules  is  possible  if  some  property  of  the 
sequence  is  reflected  in  the  representation  so  that  it  can  be 
subsequently  captured  by  the  classifier.  To  achieve  that,  we 


separate  the  given  DNA  sequences  into  two  sets:  training  set 
Tr  and  testing  set  Ts  {Tr  f^Ts  =  0).  The  training  set  Tr  is 
used  for  building  the  classifier,  and  the  testing  set  Ts  is  used 
to  determine  how  efficient  the  resultant  classifier  is.  The 
resultant  classifier  can  be  used  to  classify  any  valid  input 
sequence.  Below  we  describe  some  approaches  along  this 
line. 

Representation  2:  It  has  been  observed  that  inside  intron,  not 
all  triplets  of  nucleotides  (called  codons)  appear  with  the 
same  probability.  Specifically,  the  probability  of  occurrence 
of  a  nucleotide  in  intron  is  different  for  each  position.  The 
exon  does  not  have  this  property.  Hence,  this  property  can  be 
exploited  to  find  the  difference  between  intron  region  and 
exon  region.  Each  position  in  the  codon  is  represented  by  0, 
1,  and  2.  Hence  any  nucleotide  can  be  viewed  as  a  member  of 
the  alphabet  {AO,  Al,  A2,  TO,  Tl,  T2,  CO,  Cl,  C2,  GO,  Gl, 
G2}.  A1  indicates  that  the  nucleotide  is  Adenine  and  it  at  the 
second  position  of  the  codon.  Using  this  representation,  the 
Jenson-Shanon  divergence  measure  [Galvan:00]  is  computed 
for  the  two  halves  (FI  to  F30  and  F31  to  F60).  The  Jenson- 
Shanon  divergence  measure  is  supposed  to  attain  the 
maximum  value  at  the  point  where  two  dissimilar  regions  are 
merged.  If  the  divergence  measure  crosses  some  threshold, 
then  it  is  considered  as  the  splice  junction.  The  appropriate 
value  of  the  threshold  is  estimated  from  the  training  set. 

Representation  3:  Like  the  first  representation,  four  different 
symbols  are  used  here.  Here  the  focus  is  on  (a)  the  frequency 
of  occurrence  of  a  particular  nucleotide  at  a  particular 
position,  and  (b)  how  the  occurrence  of  that  nucleotide  is 
influenced  by  the  previous  nucleotides  in  the  series.  As  a 
classifier,  we  have  used  hidden  Markov  model  with  five 
states.  We  train  the  model  using  the  Baum-Welch  algorithm 
[Bengio:99]  on  the  training  set. 

Representation  4:  Here  we  are  interested  in  the  structure  of 
the  input  space.  In  this  representation,  each  nucleotide  is 
represented  as  [Towel:94] 

A  =  0001,  G=0010,  C  =  0100  andT  =  1000  (1) 

Note  that  here  each  nucleotide  is  occupying  a  corner  of  a  four 
dimensional  hypercube,  and  hence  the  distance  between  any 
two  nucleotides  is  constant.  Another  possible  representation 
is  A  =  00,  T  =  01,  C  =  10,  G  =  11.  But  this  representation  is 
more  biased  from  the  Euclidean  distance  sense  since  it 
indicates  A  is  closer  to  T  than  G.  Hence,  we  have  adopted  the 
representation  of  Equation  (1).  The  sequence  ATC  is 
represented  as  0001  1000  0100.  Thus,  each  DNA  sequence  of 
the  training  and  test  set  is  represented  as  a  string  of  240  zeros 
and  ones. 

We  have  applied  a  feedforward  neural  network  with 
backpropagation  learning  as  a  classifier  [Jang:97].  The 
network  has  three  layers,  and  the  numbers  of  nodes  in  the 
first,  second  and  third  layers  are  240,  10  and  3,  respectively. 


The  classifier  searches  240-dimensional  hyperspace  to  find  a 
structure  in  the  hyperspace.  The  classifier  classifies  a  test 
sequence  based  on  in  which  stmcture  the  test  pattern  falls. 
Note  that  while  forming  the  input  space  or  while  searching 
the  input  space,  we  do  not  consider  explicitly  the  interaction 
between  two  neighboring  nucleotides. 

Representation  5:  In  this  representation,  we  express  a  DNA 
sequence  using  the  same  format  as  in  Equation  (1).  But,  the 
sequence  is  formed  as  a  4x60  vector.  Hence,  the  sequence 
ATC  is  represented  as  „  i  „  (2) 

0  0  1 
0  0  0 
1  0  0 

In  addition,  the  sequence  is  considered  similar  to  a  time 
series.  It  means  the  appearance  of  a  nucleotide  depends  on  the 
previous  nucleotide  of  the  series. 

We  have  constructed  a  time  delay  feedforward  neural 
network  (with  four  unit  time  delay)  that  approximates  the 
sequences  of  Tr  for  a  particular  class  [Plataniotis:96].  For 
three  classes  we  have  used  three  neural  networks  of  same 
configurations:  20,  5  and  4  nodes  in  the  first,  hidden  and 
output  layers,  respectively.  The  networks  act  as  predictors, 
and  their  prediction  error  is  used  to  train  them.  Let  us  assume 

a  sequence  X  =  [XjXj.-.XgQj'e  T’r  is  from  the  class  El. 
When  a  part  of  the  sequence  x  (say  XjXj.-.Xjj)  is  fed,  the 
network  predicts  (say  )  the  nucleotide  at  the  16*  position 
based  on  the  last  four  nucleotides  (i.e.  in  x. 

Now  the  difference  between  Ojg  and  Xjg  is  found  out.  This 

difference  acts  as  an  error  to  train  the  network  iteratively  so 
that  the  prediction  becomes  more  accurate.  Next  the 

difference  between  the  network  output  O^-j  and  the  actual 

nucleotide  value  Xj^  is  calculated.  Again,  the  difference  is 

used  to  train  the  network.  This  procedure  is  carried  out  for  the 
whole  sequence  x,  and  it  is  repeated  for  all  the  sequences  of 
the  given  class.  Thus  for  the  three  classes,  three  predictors  are 
trained. 

When  a  new  sequence  X  =  [XjX2...Xgg]'G  Ts  appears,  it  is 
fed  to  all  the  predictors.  Initially  X,X2X3X4  is  fed  to  the 
predictor  for  the  class  El.  It  produces  the  output  Oj .  Now  the 

error  is  computed  as  =  (Xg  —  Og)^  .  Following  the  similar 

procedure,  the  total  error  by  the  predictor  is 

£60  „  \2 

(x,  —0-)  ■  Similarly  the  errors  for  the 

predictors  corresponding  to  the  other  classes  are  calculated. 
The  predictor  that  produces  the  least  error  is  the  model  closest 
to  the  sequence,  and  hence  the  class  label  of  the  closest 
predictor  is  accepted  as  the  class  label  of  the  test  sequence. 


Representation  6:  This  measure  attempts  to  extract  some 
regularity  from  the  apparent  randomness  inherent  in  the  DNA 
sequence.  In  the  conventional  random  walk  model,  a  walker 
moves  either  up  (m,  =-l-l)  or  down  (m,  =-l)  by  one  unit 
length  at  the  ith  step  of  the  walk.  Following  this  concept,  one 
definition  of  the  DNA  walk  is  that  the  walker  steps  up  if  a 
pyrimidine  (C  or  7)  occurs  at  the  position  i  along  the  DNA 
chain,  while  walker  steps  down  if  a  purine  (A  or  G)  occurs  at 
the  /th  position  [Havlin:99].  Thus  at  the  /th  nucleotide  the 

position  of  the  walker  is  Y.  =  ^' ,  ^u.  ,  and  the  sequence 

appears  as  a  time  series.  Note  that  in  the  Representation  4  and 
5,  the  sequence  can  have  values  only  in  {0,  1};  but  now  the 
sequence  can  have  any  positive  or  negative  discrete  value. 
The  classification  is  carried  out  in  the  following  steps:  (a) 
Represent  all  sequences  in  the  training  and  test  sets  with  the 
DNA  walk  representation,  (b)  Find  the  mean  sequences  for 
each  class.  For  instance,  the  mean  sequence  for  the  class  FI  is 

y  L 

is  from  Tr  and  class  El  ^ 

TH  — - - - 

Elj  no.  of  sequences  in  Tr  that  are  from  class  El 

(c)  When  a  new  test  sequence  F  =  [YjY2...Y^o]'  appears,  we 
find  the  mean  sequence  closest  to  it  using  the  following 
similarity  measure:  5’£,(fM£^,F)  =  -T.)^  .  (d)  The 

class  label  of  Y  is  the  class  label  of  the  closest  mean 
sequence. 

Using  the  DNA  walk  representation,  we  have  plotted  the 
traces  corresponding  to  all  the  three  classes  (Fig.  3).  We  can 
observe  that  to  some  extent  the  lines  representing  the  classes 
IE  and  El  can  be  separated  visually;  however,  even  for  a 
human  observer  it  is  difficult  to  separate  the  lines 
corresponding  to  the  class  N  from  the  lines  of  the  other  two 
classes.  It  shows  that  no  classifier  with  the  DNA  walk 
representation  can  produce  high  classification  efficiency  for 
the  class  N. 

IX 
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Fig.  3:  The  blue,  red  and  green  lines  indicate  the  DNA  walk  for  the  genes 
with  intron/exon  boundary,  exon/intron  boundary  and  neither.  We  can 
observe  that  blue  and  red  lines  can  be  separated  better  than  the  red  and  green, 
and  blue  and  green  lines.  Hence,  while  using  the  DNA  walk  representation, 
the  classifiers  also  cannot  classify  the  sequences  of  the  class  N. 


Representation  7:  In  the  DNA  walk,  it  has  been  noticed  that  if 
the  ruggedness  or  irregularity  of  a  part  of  the  DNA  walk  is 
scaled  up,  then  the  resultant  ruggedness  becomes  similar  to 
the  mggedness  or  irregularity  of  the  whole  sequence 
[Havlin:99].  This  clue  can  act  as  the  regularity  in  the  DNA 
walk,  and  thus  it  can  be  used  to  characterize  the  time  series. 
The  Hurst  exponent  intends  to  quantify  this  clue  such  that  the 
quantified  values  are  relatively  insensitive  to  translation, 
scaling,  noise  and  nonstationarity  [Addision:97].  One  popular 
approach  to  estimate  the  Hurst  exponent  is  the  dispersional 
analysis.  It  needs  the  following  three  steps  to  be  performed 
on  a  sequence: 

1.  Partitioning:  Each  sequence  is  partitioned  into  equal 
intervals  of  length  5, .  Let  us  call  the  ith  interval  IT(/,5,) . 

2.  Single  scale  statistics:  It  can  be  of  two  types: 

(a)  Local  statistics:  Statistics  based  on  the  values  of  the  DNA 
walk  within  a  single  interval  are  extracted.  The  local  statistic 
in  the  interval  W{i,d,)  is  the  mean  of  the  data  in  each 
interval. 

(b)  Partition  based  statistics:  The  partition-based  statistic 
7(5,)  is  the  standard  deviation  of  the  means.  Next,  the  whole 
process  is  repeated  for  several  lengths  of  the  interval  5, . 

3.  Transscale  statistics:  The  transscale  statistics  is  1.0  plus 
the  slope  of  a  linear  regression  that  fits  a  plot  of  log(7(5,)) 
vs.  log(5,)  for  all  /. 

The  Hurst  exponents  are  extracted  for  exon  and  intron  regions 
of  each  sequence  in  the  training  set.  The  extracted  Hurst 
coefficients  are  used  to  train  a  two-class  feedforward  neural 
network  with  backpropagaton  learning.  Whenever  a  new 
sequence  appears,  the  Hurst  coefficients  of  its  two  halves  are 
determined,  and  then  the  Hurst  exponent  of  each  half  is  fed  to 
the  neural  network  to  identify  the  type  of  that  half.  The  splice 
junction  can  be  classified  easily  after  we  know  the  identity  of 
each  half. 

III.  Results  and  discussion 

In  the  data  set,  we  have  removed  the  sequences  where  some 
nucleotides  have  unknown  values.  Half  of  the  available 
sequences  are  used  for  training  and  the  remaining  half  are 
used  for  testing.  We  have  not  studied  Representation  1  as  it  is 
computationally  intractable.  The  classification  performances 
using  the  remaining  six  representations  are  shown  in  table  i. 
We  can  observe  that  the  fourth  representation  is  providing  the 
best  result  even  for  the  class  N.  This  conclusion  remains  valid 
even  when  we  performed  experiments  on  three  other  large 
data  sets  obtained  from  GenBank.  It  indicates  that  the 
structure  of  the  input  space  is  important.  Representation  4  is 
able  to  form  a  better  stmcture  in  the  input  space  and  the 
dependency  of  the  neighbors  is  implicitly  captured  in 
Representation  4. 


TABLE  1 


COMPARATIVE  CLASSIFICATION  PERFORMANCE  WITH  SIX  REPRESENTATIONS.  THE 
CLASSES  ARE  INTRON-EXON  (IE)  BOUNDARY,  EXON-INTRON  (El)  BOUNDARY  AND 
NEITHER  (N). 


Representation 

IE 

El 

N 

Overall 

2 

34.34% 

46.32% 

61.32% 

50.82% 

3 

72.45% 

78.56% 

45.67% 

60.58% 

4 

85.16% 

93.12% 

70.45% 

79.79% 

5 

67.23% 

76.35% 

22.45% 

47.12% 

6 

82.67% 

91.47% 

9.32% 

48.19% 

7 

52.45% 

54.37% 

54.37% 

53.89% 

Note  that  (a)  instead  of  neural  networks  or  hidden  Markov 
models,  other  classifiers  could  be  used,  and  in  that  case  the 
classification  performances  may  vary,  (b)  for  Representation 
1,  4,  5  and  6,  all  training  and  testing  sequences  should  be  of 
same  length,  although  for  Representation  2,  3  and  7  this 
constraint  is  not  present,  and  (c)  due  to  the  space  limitation, 
we  could  not  discuss  many  other  representations. 

The  future  works  would  be  (a)  construction  of  a  better 
classifier  using  Representation  4,  (b)  extraction  of 

classification  rules  from  the  data  set,  (c)  studying  the 
efficiency  of  the  Representation  4  when  some  DNA  sequence 
has  missing  values,  and  (d)  developing  a  modular  approach 
involving  all  the  representations. 
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